AITopics | codebook size

Collaborating Authors

codebook size

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Scaling the Codebook Size of VQ-GAN to 100,000 with a Utilization Rate of 99%

Neural Information Processing SystemsMar-18-2026, 11:49:18 GMT

In the realm of image quantization exemplified by VQGAN, the process encodes images into discrete tokens drawn from a codebook with a predefined size. Recent advancements, particularly with LLAMA 3, reveal that enlarging the codebook significantly enhances model performance. However, VQGAN and its derivatives, such as VQGAN-FC (Factorized Codes) and VQGAN-EMA, continue to grapple with challenges related to expanding the codebook size and enhancing codebook utilization. For instance, VQGAN-FC is restricted to learning a codebook with a maximum size of 16,384, maintaining a typically low utilization rate of less than 12% on ImageNet. In this work, we propose a novel image quantization model named VQGAN-LC (Large Codebook), which extends the codebook size to 100,000, achieving an utilization rate exceeding 99%. Unlike previous methods that optimize each codebook entry, our approach begins with a codebook initialized with 100,000 features extracted by a pre-trained vision encoder. Optimization then focuses on training a projector that aligns the entire codebook with the feature distributions of the encoder in VQGAN-LC. We demonstrate the superior performance of our model over its counterparts across a variety of tasks, including image reconstruction, image classification, auto-regressive image generation using GPT, and image creation with diffusion-and flow-based generative models.

artificial intelligence, machine learning, natural language, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Natural Language (0.82)
Information Technology > Sensing and Signal Processing > Image Processing (0.59)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.59)

Add feedback

1716d022edeac750e57a2986a7135e13-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 12:45:29 GMT

codebook, codebook size, vqgan-lc, (15 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Vector Quantization using Gaussian Variational Autoencoder

Xu, Tongda, Zheng, Wendi, He, Jiajun, Hernandez-Lobato, Jose Miguel, Wang, Yan, Zhang, Ya-Qin, Tang, Jie

arXiv.org Artificial IntelligenceDec-9-2025

V ector quantized variational autoencoder (VQ-V AE) is a discrete auto-encoder that compresses images into discrete tokens. It is difficult to train due to dis-cretization. In this paper, we propose a simple yet effective technique, dubbed Gaussian Quant (GQ), that converts a Gaussian V AE with certain constraint into a VQ-V AE without training. GQ generates random Gaussian noise as a code-book and finds the closest noise to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian V AE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian V AE for effective GQ, named target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-V AEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves upon previous Gaussian V AE discretization methods, such as TokenBridge. V ector-quantized variational autoencoder (V an Den Oord et al., 2017) is an autoencoder that compresses images into discrete tokens. It is fundamental to autoregressive generative models (Esser et al., 2021; Chang et al., 2022; Y u et al., 2023; Sun et al., 2024b). However, VQ-V AE is difficult to train: the encoding process of VQ-V AE is not differentiable and challenges such as codebook collapse often emerge (Sønderby et al., 2017).

artificial intelligence, machine learning, semanticscholar, (17 more...)

arXiv.org Artificial Intelligence

2512.06609

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Unveiling the Impact of Data and Model Scaling on High-Level Control for Humanoid Robots

Wei, Yuxi, Wang, Zirui, Yin, Kangning, Hu, Yue, Wang, Jingbo, Chen, Siheng

arXiv.org Artificial IntelligenceDec-9-2025

Abstract-- Data scaling has long remained a critical bottleneck in robot learning. For humanoid robots, human videos and motion data are abundant and widely available, offering a free and large-scale data source. Besides, the semantics related to the motions enable modality alignment and high-level robot control learning. However, how to effectively mine raw video, extract robot-learnable representations, and leverage them for scalable learning remains an open problem. T o address this, we introduce Humanoid-Union, a large-scale dataset generated through an autonomous pipeline, comprising over 260 hours of diverse, high-quality humanoid robot motion data with semantic annotations derived from human motion videos. The dataset can be further expanded via the same pipeline. Building on this data resource, we propose SCHUR, a scalable learning framework designed to explore the impact of large-scale data on high-level control in humanoid robots. Experimental results demonstrate that SCHUR achieves high robot motion generation quality and strong text-motion alignment under data and model scaling, with 37% reconstruction improvement under MPJPE and 25% alignment improvement under FID comparing with previous methods. Its effectiveness is further validated through deployment in real-world humanoid robot.

artificial intelligence, representation, robot motion, (14 more...)

arXiv.org Artificial Intelligence

2511.09241

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Robots > Humanoid Robots (1.00)

Add feedback

Meta-Learning Multi-armed Bandits for Beam Tracking in 5G and 6G Networks

Mattick, Alexander, Yammine, George, Kontes, Georgios, Maghsudi, Setareh, Mutschler, Christopher

arXiv.org Artificial IntelligenceDec-8-2025

Beamforming-capable antenna arrays with many elements enable higher data rates in next generation 5G and 6G networks. In current practice, analog beamforming uses a codebook of pre-configured beams with each of them radiating towards a specific direction, and a beam management function continuously selects \textit{optimal} beams for moving user equipments (UEs). However, large codebooks and effects caused by reflections or blockages of beams make an optimal beam selection challenging. In contrast to previous work and standardization efforts that opt for supervised learning to train classifiers to predict the next best beam based on previously selected beams we formulate the problem as a partially observable Markov decision process (POMDP) and model the environment as the codebook itself. At each time step, we select a candidate beam conditioned on the belief state of the unobservable optimal beam and previously probed beams. This frames the beam selection problem as an online search procedure that locates the moving optimal beam. In contrast to previous work, our method handles new or unforeseen trajectories and changes in the physical environment, and outperforms previous work by orders of magnitude.

artificial intelligence, codebook, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2512.0568

Country: Europe > Germany (0.14)

Genre: Research Report (1.00)

Industry:

Telecommunications (0.68)
Energy (0.46)
Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

RAVQ-HoloNet: Rate-Adaptive Vector-Quantized Hologram Compression

Rafiei, Shima, Babak, Zahra Nabizadeh Shahr, Samavi, Shadrokh, Shirani, Shahram

arXiv.org Artificial IntelligenceNov-27-2025

Holography offers significant potential for AR/VR applications, yet its adoption is limited by the high demands of data compression. Existing deep learning approaches generally lack rate adaptivity within a single network. We present RAVQ-HoloNet, a rate-adaptive vector quantization framework that achieves high-fidelity reconstructions at low and ultra-low bit rates, outperforming current state-of-the-art methods. In low bit, our method exceeds by -33.91% in BD-Rate and achieves a BD-PSNR of 1.02 dB from the best existing method demonstrated by the rate-distortion curve.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.21035

Country: North America > Canada > Ontario (0.28)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

1716d022edeac750e57a2986a7135e13-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 19:27:59 GMT

codebook, codebook size, vqgan-lc, (15 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Quantize-then-Rectify: Efficient VQ-VAE Training

Zhang, Borui, Rao, Qihang, Zheng, Wenzhao, Zhou, Jie, Lu, Jiwen

arXiv.org Artificial IntelligenceJul-15-2025

Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.

artificial intelligence, machine learning, quantization error, (17 more...)

arXiv.org Artificial Intelligence

2507.10547

Genre:

Research Report > New Finding (0.34)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.69)

Add feedback

Task-Driven Discrete Representation Learning

Vuong, Tung-Long

arXiv.org Artificial IntelligenceJun-16-2025

In recent years, deep discrete representation learning (DRL) has achieved significant success across various domains. Most DRL frameworks (e.g., the widely used VQ-VAE and its variants) have primarily focused on generative settings, where the quality of a representation is implicitly gauged by the fidelity of its generation. In fact, the goodness of a discrete representation remain ambiguously defined across the literature. In this work, we adopt a practical approach that examines DRL from a task-driven perspective. We propose a unified framework that explores the usefulness of discrete features in relation to downstream tasks, with generation naturally viewed as one possible application. In this context, the properties of discrete representations as well as the way they benefit certain tasks are also relatively understudied. We therefore provide an additional theoretical analysis of the trade-off between representational capacity and sample complexity, shedding light on how discrete representation utilization impacts task performance. Finally, we demonstrate the flexibility and effectiveness of our framework across diverse applications.

artificial intelligence, machine learning, representation, (15 more...)

arXiv.org Artificial Intelligence

2506.11511

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Spoken Language Modeling with Duration-Penalized Self-Supervised Units

Visser, Nicol, Kamper, Herman

arXiv.org Artificial IntelligenceMay-30-2025

Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.

artificial intelligence, chatbot, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.23494

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.63)

Add feedback